Modelling in R

Made for

Bush Heritage

Date

March 18, 2024

TODO - need to add in examples with real data

Statistical modelling in R for ecologists (Part 1)

Whenever you embark on a journey in statistical modelling, it’s good to be reminded of the following quote from George E.P. Box,

‘All models are wrong, but some are useful.’.

Photograph of Geoge E.P. Box, “one of the great statistical minds of the 20th century”

We build models to help us understand ecological relationships. But no model is perfect. Ecological data in particular is fraught with a lot of random, natural variation in observations that our models will be unable to explain.

Modelling goals can be three-fold:

  1. Exploration, e.g. which environmental variables are associated with species richness?,
  2. Inference, e.g., do protected areas cause increased species richness?,
  3. Prediction, e.g., use correlations between environmental variables and species abundance to predict species habitat suitability across space.

How we build and select models depends on our goals for the analysis. Check out this great guide to goal driven model selection in ecology.

Here, we will focus primarily on building exploratory and predictive models. (Causal inference is beyond the scope, but there is an excellent introduction for ecologists here).

We’ll start with a basic linear model to explore how disturbance influences ecological characteristics across survey sites. We’ll see how, depending on the type of data we are modelling, we might need to use generalised linear models (as opposed to general linear models). We’ll end by building more complex spatial and multivariate models in R.

This course assumes an understanding of how to wrangle, summarise, and visualise data in R. We hope you will learn:

  • The fundamentals of fitting statistical models to data,
  • how to check your model fits the data well, and
  • how to understand and visualise model output in R.

General linear models

We can use linear models to estimate the strength and direction of ecological relationships. The word ‘linear’ is important - when we use linear models we assume a straight (or linear) relationship between our response variable (also referred to as dependent variables) and our explanatory variables (also referred to as covariates, predictor, or independent variables - don’t let the different terms confuse you).

In addition to linearity, there are three other assumptions: homoskedasticity (constant variance) and normality of the residuals, and independence of observations. Don’t worry about those for now though, we’ll discuss them a bit later.

Explanatory variables can be either continuous (left plot below) or categorical (right plot below).

If you took a first year undergrad stats course, you might be thinking, let’s use a t-test/ANOVA for the categorical example above! A general linear model with a categorical explanatory variable is essentially the same thing - it will estimate the the mean difference between levels of a categorical variable.

One other important thing to note: although general linear models assumes a linear relationship between the response and explanatory variable, they can accommodate non-linearity via a non-linear transformation of the explanatory variable, e.g., this quadratic relationship between the explanatory and response.

A little bit of maths

To understand how they work, it helps to take a quick look at how we formulate linear models mathematically:

\[y_i = b_0 + b_{1}X_{1i} + \epsilon_i,\] \(y_i\) is our response variable of interest, \(b_0\) is the is the intercept for the linear relationship between \(y\) and our explanatory variables (\(X\)’s), \(b_1\) is the coefficient representing the strength and direction of the relationship between \(y\), and X_1 \(\epsilon\) is the random, unexplained error in our observations around the fitted relationship between \(y\) and \(X\)(s).

In the above example we only have one explanatory variable (\(X\)), but we can have multiple (referred to as ‘multiple linear regression’). We would just add on another term to our equation, e.g. \(b_{2}X_{2i}\).

When we’re fitting general linear models, we assume that \(\epsilon\), the error term, is drawn from a normal (i.e., gaussian) distribution. We can write that assumption mathematically as,

\[\epsilon \sim Normal(0, \sigma)\] We would expect this if our response variable (\(y\)) is continuous and unbounded (could potentially be any number between \(-∞\) and \(∞\)). In the next section, we’ll learn how to fit generalised linear models which allow us to model non-normal error distributions necessary for count data, for example.

Simulating data to enhance our understanding

When fitting models to data, one of the most important tools at our disposal is data simulation. It’s a very important, but often under-utilised skill. We can use R to simulate data that corresponds to what we expect to observe in the real world, given what we know about the data generating process.

Simulated data has a known truth. We know exactly what the relationship between \(x\) and \(y\) is because we define it mathematically when we do the simulation (we’ll do this in a sec).

Knowing the truth is very powerful when developing statistical models. We can directly test whether our model can reliably recover the known truth. If they can, then we can be reasonably confident that they will perform well on data collected in the real world (given our assumption regarding the data generating process are correct).

Simulation is also a great way to enhance our understanding of how models work. So let’s try it.

First we will simulate data to recreate figures A) and B) above. Then we will fit linear models to see if the can reliably estimate the known truth.

Continuous explanatory variable

As an example, we’ll consider our response variable of interest as tree circumference, and our explanatory variable to be fire severity. In the real world, we’ve made twenty independent measurements of tree circumference, and also measured fire severity at those locations.

So we will simulate a dataset of 100 tree circumference measurements and their corresponding fire severity, based on our understanding of the data generating process underlying tree circumference. (DISCLAIMER: I am not an expert in forest ecology, and so all of the below assumptions are just for illustrative purposes, and could very much be wrong).

Let’s say we know that:

  1. fire causes tree death and succession, resulting in smaller average tree circumference in locations with higher fire severity, and
  2. the relationship between fire severity and tree circumference is negative and linear (i.e., a one unit increase in fire severity will result in a decrease in tree circumference),
  3. tree circumference is a continuous variable, and the natural variation (error) in our measurements will be normally distributed with a mean of 0 and a standard deviation of 1.

Given the above, we think that we can use a general linear model to estimate the size of the effect (relationship) between tree circumference (\(y\)) and fire severity (\(x\)).

So, we will the equation for a linear model to simulate 20 tree circumference measurements (\(y\)):

\[y_i = b_0 + b_{1}X_{1i} + \epsilon_i.\] We’ll assume that, on average, the magnitude of the negative effect of fire severity on tree circumference is -1, meaning a one unit increase in fire severity results in a 1 unit decrease in tree circumference. This is the \(b_{1}\) coefficient in the above equation. When fitting models to real data, this is often the relationship that is unknown and we are trying to estimate.

We’ll also assume that, on average, in locations without fire (i.e., fire severity = 0), tree circumference is equal to 2. This is the \(b_0\) coefficient in the above equation and is usually unknown.

Fire severity (\(X\)) will be measured on a continuous scale between 0 to 1, where 0 = no fire and 1 = highest level of fire severity.

As mentioned above (C)), we expect natural variation in tree circumference measurements (i.e., variation not due to fire severity; \(\epsilon\)) to be normally distributed with a mean of 0 and standard deviation of 0.1.

Simulating data

First let’s set up our simulation parameters - they are constants in the linear model.

b0 <- 5 # average tree circumference when fire severity is = 0 (i.e, y-intercept)
b1 <- -1 # beta coefficient representing association between tree circumference and fire severity
e_mu <- 0 # mean error
e_sd <- 0.1 # error standard deviation

Great, now all we need to simulate tree circumference is fire severity measurements (\(X\)) and error (\(\epsilon\)). We need 100 of each (n = 100). We’ll draw fire severity measurements randomly from a uniform distribution with a minimum value of 0 and a max of 1.

n <- 100 # number of tree circumference measurements
X <- runif(n, min = 0, max = 1) # fire severity measurements
e <- rnorm(n, e_mu, e_sd) # natural error

Now we can simulate tree circumference using our linear equation: \(y_i = b_0 + b_{1}X_{1i} + \epsilon_i.\).

y <- b0 + b1*X + e # simulate tree circumference observations for each fire severity measurement
df <- data.frame(Fire_severity = X, Tree_circumference = y) # make a dataframe of observations
head(df)
  Fire_severity Tree_circumference
1   0.820333669           4.192305
2   0.522016990           4.501716
3   0.004953914           5.134029
4   0.355534569           4.579778
5   0.993639989           3.886823
6   0.170473480           4.710671

Looks good. Now let’s plot it

library(ggplot2)
ggplot(df) +
  aes(x = Fire_severity, y = Tree_circumference ) +
  geom_point() +
  ylab('Tree circumference') +
  xlab('Fire severity') +
  geom_smooth(method = 'lm') +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

Tip Do your numbers look slightly different from mine? To make our results reproducible, when ever we use a random number generator in R, we should set the seed.

set.seed(123)
X <- runif(n, min = 0, max = 1) # fire severity measurements
e <- rnorm(n, e_mu, e_sd) # natural error
y <- b0 + b1*X + e # simulate tree circumference observations for each fire severity measurement
df <- data.frame(Fire_severity = X, Tree_circumference = y) # make a dataframe of observations
# plot it
ggplot(df) +
  aes(x = Fire_severity, y = Tree_circumference ) +
  geom_point() +
  ylab('Tree circumference') +
  xlab('Fire severity') +
  geom_smooth(method = 'lm') +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

Fitting linear models to simulated data to recover known truths

Great, we have successfully simulated tree circumference observations! The relationship looks as we expected, with tree circumference decreasing, on average, with increasing fire severity (plus/minus some natural, random variation).

Now we want to know - can we accurately estimate the known truths, i.e., the y-intercept \(b_0\) and the beta coefficient \(b_{1}\), from our simulated data with a linear model?

m <- lm(Tree_circumference ~ Fire_severity, data = df)
m

Call:
lm(formula = Tree_circumference ~ Fire_severity, data = df)

Coefficients:
  (Intercept)  Fire_severity  
        4.999         -1.009  

Pretty close to our known truths. What if we have fewer observations though, or more?

How are the coefficients of the model (the intercept \(b_0\) and the beta coefficient \(b_{1}\) estimated? For a simple, general linear model like this, we (meaning R) uses Ordinarly Least Squares (OLS) to estimate the model coefficients. If you’re a visual learner like me, check out this great tool to see how OLS works. In a nutshell, we’re iteratively rotating the straight line between our y and x variables until we get the best fit (i.e. have minimised the sums of squared error).

Checking the model fit
summary(m)

Call:
lm(formula = Tree_circumference ~ Fire_severity, data = df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.223797 -0.061323 -0.001973  0.059633  0.221723 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.99910    0.01961  254.99   <2e-16 ***
Fire_severity -1.00898    0.03418  -29.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09693 on 98 degrees of freedom
Multiple R-squared:  0.8989,    Adjusted R-squared:  0.8979 
F-statistic: 871.3 on 1 and 98 DF,  p-value: < 2.2e-16

Let’s unpack this model summary. [Talk through model output, if time write notes to explain..]

Without doing anything else, we already have one metric for assessing model fit - the R2. It tells us our model has explained a whopping 89% of the variability in the data (note, use the adjusted R^2 if you have more than one explanatory variable). That’s alot - the model is doing well! But remember this is not real data (we simulated it) - in ecology we usually see much lower R^2 values due to the natural sources of variability that we can’t explain.

Tip Try increased the standard deviation of the natural variation we’ve simulated (e_sd) and see how that changes the R2.

In addition to how much variability in the data our model is explaining, we also want to want to make sure the model isn’t violating any structural assumptions, namely that the model residuals are:

  1. Normally distributed, and have
  2. Homogeneity of variance.

What are residuals? There are different types of residuals, but generally speaking, residuals are the difference between model predictions and observations - e.g., how far are each of the individual data points from line fitted (predicted) by our model? Residuals therefore represent variability in our data that isn’t explained by the model. See here for further explanation.

We can use model diagnostic plots to check these assumptions:

par(mfrow = c(2,2))
plot(m)

These diagnostic plots show that our fitted model does not violate any structural model assumptions.

Let’s look at what happens if we try to fit a model that doesn’t fit the data well. Let’s simulate data with a non-linear (cubic) relationship between fire severity and tree circumference and try to fit a linear model to it.

y <- b0 + b1*X^3 + e # simulate tree circumference observations for each fire severity measurement
df <- data.frame(Fire_severity = X, Tree_circumference = y) # make a dataframe of observations
# plot it
ggplot(df) +
  aes(x = Fire_severity, y = Tree_circumference ) +
  geom_point() +
  ylab('Tree circumference') +
  xlab('Fire severity') +
  geom_smooth(method = 'lm', col = 'red') +
  geom_smooth(method = 'loess', col = 'blue') +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

We can already see that assuming a linear relationship between fire severity and tree circumference (the red line) is unlikely to yield a model that fits the data well. But let’s try fitting the linear model anyway and seeing what the model diagnostics say about the model fit.

m <- lm(Tree_circumference ~ Fire_severity, data = df)
summary(m)

Call:
lm(formula = Tree_circumference ~ Fire_severity, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40736 -0.08783  0.01631  0.09659  0.32965 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.20466    0.02820  184.56   <2e-16 ***
Fire_severity -0.91264    0.04917  -18.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1394 on 98 degrees of freedom
Multiple R-squared:  0.7785,    Adjusted R-squared:  0.7763 
F-statistic: 344.5 on 1 and 98 DF,  p-value: < 2.2e-16

Already we see the explanatory power of the model is lower than before (R2 of 78%). Note also that our estimates of the model coefficients are slightly biased compared to before.

Now let’s check the diagnostics.

par(mfrow = c(2,2))
plot(m)

The QQ plot looks ok, but what about the residuals vs. fitted? They residuals clearly to not exhibit homogeneity of variance. In fact, the u-shaped pattern in the residuals vs. fitted plot gives us a clue that the model is not fitting the data well because there is a non-linear relationship between the response and explanatory variable.

What can we do? We can try transforming the explanatory variable the same way we did when simulating the data to produce the non-linear relationship. If we transform the explanatory variable, the linear model still assumes a linear relationship between the response and the transformed explanatory variable.

m <- lm(Tree_circumference ~ I(Fire_severity^3), data = df)
summary(m)

Call:
lm(formula = Tree_circumference ~ I(Fire_severity^3), data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22587 -0.06074  0.00113  0.05736  0.22499 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         4.99370    0.01292  386.55   <2e-16 ***
I(Fire_severity^3) -0.99621    0.03485  -28.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09696 on 98 degrees of freedom
Multiple R-squared:  0.8929,    Adjusted R-squared:  0.8918 
F-statistic: 817.1 on 1 and 98 DF,  p-value: < 2.2e-16

Note the use of I in the model formula - this means ‘as is’. It allows us to transform the explanatory variable within the model forumla. What do you notice about the coefficient estimates and R2?

Now the diagnostics.

par(mfrow = c(2,2))
plot(m)

Much better, no strong patterns in the residuals.

Sometimes, non-linear relationships cannot be adequately captured by a parametric, linear model. In this case, you might consider trying a generalised additive model. This is outside the scope of this coarse, but {mgcv} in R is excellent for this.

Categorical explanatory variable

Instead of a continuous measurement of fire severity, perhaps we only know whether our locations have been burned or are un-burned. Can we still use a linear model to estimate the difference in tree circumference (on average) between burned and un-burned locations?

Let’s simulate data with a categorical explanatory variable for fire (Burned vs. Un-burned), instead of a continuous measure of fire severity.

We will make all the same assumptions as before. Although our known truth for \(b_{1}\) is slightly different now that we have a categorical predictor, i.e., a \(b_{1}\) equal to -1 means that tree circumference, on average, is 1 unit lower at burned locations compared to un-burned locations.

Let’s simulate the data and plot it.

set.seed(123) # set random number generator for reproducibility

# simulation parameters
b0 <- 5 # average tree circumference when fire severity is = 0 (i.e, y-intercept)
b1 <- -1 # beta coefficient representing association between tree circumference and fire severity
e_mu <- 0 # mean error
e_sd <- 0.1 # error standard deviation

# simulation variables
n <- 100 # number of tree circumference measurements
X <- rep(c(0,1), n/2) # burned vs. un-burned
e <- rnorm(n, e_mu, e_sd) # natural error

# simulate y
y <- b0 + b1*X + e # simulate tree circumference observations for each fire severity measurement
df <- data.frame(Fire_severity = X, Tree_circumference = y) # make a dataframe of observations
 
#plot
ggplot(df) +
  aes(x = factor(Fire_severity), y = Tree_circumference) +
  geom_boxplot() +
  geom_jitter() +
  ylab('Tree circumference') +
  xlab('Fire severity') +
  geom_smooth(method = 'lm') +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

Note that the categories for fire severity are 0 and 1. The 0 corresponds to ‘Un-burned’ and the 1 corresponds to ‘Burned’ locations. We should (and easily can) label these as such for clarity, but I’ve purposefully left them as 0’s and 1’s here to illustrate something important about what linear models do when estimating coefficients for categorical variables.

Linear models dummy code categorical explanatory variables so that the reference level (in this case ‘Un-burned’) is equal to 0, and the non-reference level(s) (in this case ‘Burned’) are equal to 1. Let’s fit the model and look at the summary.

m <- lm(Tree_circumference ~ Fire_severity, data = df)
summary(m)

Call:
lm(formula = Tree_circumference ~ Fire_severity, data = df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.237948 -0.057986 -0.000856  0.059773  0.209864 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.01105    0.01297  386.31   <2e-16 ***
Fire_severity -1.00402    0.01834  -54.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09172 on 98 degrees of freedom
Multiple R-squared:  0.9683,    Adjusted R-squared:  0.968 
F-statistic:  2995 on 1 and 98 DF,  p-value: < 2.2e-16

Great, so once again, we’ve been able to recover the known truth (an effect size of -1) with our linear model.

Note that if you have categorical variables with >2 levels, all of the ‘non-reference’ levels are dummy coded as 1’s. So, you end up estimating the effect of each level on the response, in reference to the ‘reference’ level. So the \(b_{1}\) coefficient of -1 means that, on average, tree circumference is one unit less at burned sites, in reference to un-burned sites.

As before, we would want to check our model meets structural assumptions using diagnostic plots.

par(mfrow = c(2,2))
plot(m)
hat values (leverages) are all = 0.02
 and there are no factor predictors; no plot no. 5

Looks great.

Fitting linear models to real data

As we’ve just seen, data simulation is a powerful tool for developing models and enhancing our understanding of how they work. Let’s try fitting some to real data. TODO

Generalised linear models

What if, in addition to tree circumference, we also wanted to ask whether species abundance is lower in sites with high fire severity?. Could we just fit the same model as before, and swap out circumference for abundance?

Remember that one of the assumptions of the linear models we’ve been using so far is that \(\epsilon\), the error term, is normally distributed. Continuous data like circumference conforms to this assumption. Counts of individuals (i.e., species’ abundance), however, can only be whole numbers - we can’t count half a bird - and is bounded from 0 to \(∞\). Residuals from models fitted to count data will therefore not be normally distributed.

So, we need to generalise our linear model a bit to accommodate this non-normal error structure. Generalised linear models allow us to accommodate many different types of non-normal error structures, including errors from count data. Once again, simulation can help us to understand how generalised linear models work. So let’s try it.

Simulating data with non-normal error structure

For our previous general linear model of tree circumference, we drew errors \(\epsilon\) from a normal distribution with a mean of 0. An alternative for count data is the poisson distribution.

The poisson is a probability distribution for discrete, whole numbers like count data, is bounded from 0 to \(∞\), and has one parameter \(lambda\) for the mean and variance (which are assumed to be equal). (Note, however, ecological count data often violates the latter assumption, in which case there are alternatives like the negative binomial).

n <- 1000
df <- data.frame(norm = rnorm(n, 1, 1), poiss = rpois(n, 1))

a <- ggplot(df) +
  aes(x = norm) +
  ylab('Frequency') +
  xlab('Random number') +
  ggtitle('A) Normal distribution') +
  geom_histogram(fill = 'cyan') +
  theme_classic()
b <- ggplot(df) +
  aes(x = poiss) +
  ylab('Frequency') +
  xlab('Random number') +
  ggtitle('B) Poisson distribution') +
  geom_histogram(fill = 'seagreen') +
  theme_classic()
a+b
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So, we can use the poisson distribution to simulate data that mimics counts. But how can we ensure that the predictions of counts from our linear equation \(y_i = b_0 + b_{1}X_{1i}\) will be constrained between 0 and \(∞\) instead of \(-∞\) and \(∞\)?

This is where the link function comes in. Generalised linear models use link functions to transform model predictions from a linear equation so that they are bounded as necessary. For the poisson, the canonical link function is the log, meaning all predictions from a generalised linear model with a poisson family and log link will be between 0 and \(∞\).

Mathematically this looks like: \[log(\lambda_i) = b_0 + b_{1}X_{1i},\] Remember \(\lambda\) is the mean species abundance (and also the variance). And \(\lambda\) is what we need to simulate counts of abundance from a pooisson distribution,

where \[y_i \sim Poisson(\lambda_i).\]

In practice, we end up using the inverse link function to make predictions of mean species abundance for each level of the explanatory variable. The inverse of the log is the exponent. So, to predict mean species abundance \(\lambda\), we can take the exponent of the linear combination of the intercept and explanatory terms, like this: \[\lambda_i = exp(b_0 + b_{1}X_{1i}).\] We can then use these predicted values of \(\lambda\) to simulate counts of species abundance from a poisson distribution. Let’s simulate counts this way below.

Before we do though, one more thing to note. By taking the exponent of the linear combination of explanatory variable(s), we’re imposing a non-additive, multiplicative relationship between \(x\) and \(y\). This changes how we interpret the \(b_{1}\) coefficient. Instead of representing the additive change in \(y\) for every one unit increase in \(x\), \(b_{1}\) now represents the multiplicative change.

Here we go.

set.seed(123)
# All counts predicted by a poisson GLM will be on the log scale. 
# So here we log our 'known truths' for the intercept and beta coefficients (remembering that )
b0 <- log(10) # average species abundance when fire severity is = 0 (i.e, intercept) is 5 10
b1 <- -log(1/0.5) # species abundance decreases by a factor of 2 (aka 50% decrease in abundance) for a one unit increase in fire severity
n <- 100 # number of observations
X <- runif(n, min = 0, max = 1) # fire severity measurements
lambda <- exp(b0 + b1*X) # estimate mean species abundance (lambda) on multiplicative scale using the exponent
y <- rpois(n, lambda) # simulate species abundance observation from each lambda
df <- data.frame(Fire_severity = X, Species_abundance = y) # make a dataframe of observations

ggplot(df) +
  aes(x = Fire_severity, y = Species_abundance) +
  geom_point() +
  ylab('Species abundance') +
  xlab('Fire severity') +
  geom_smooth(method = 'lm') +
  #geom_smooth(method = 'loess') +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

Looks just as we would expect, only whole numbers for species abundance, and species abundance decreases by half for a one unit increase in fire severity (10*0.5 = 5).

Let’s see what happens when we try to fit a regular linear model (which assumes a linear relationship between \(x\) and \(y\) and a normal error structure).

m <- lm(Species_abundance ~ Fire_severity, data = df)
summary(m)

Call:
lm(formula = Species_abundance ~ Fire_severity, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8689 -1.6252 -0.4015  1.2453  6.7348 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    10.2042     0.4986  20.464  < 2e-16 ***
Fire_severity  -5.6446     0.8694  -6.493 3.49e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.465 on 98 degrees of freedom
Multiple R-squared:  0.3008,    Adjusted R-squared:  0.2936 
F-statistic: 42.16 on 1 and 98 DF,  p-value: 3.488e-09

Right away, we can see that the beta coefficient is biased from the known truth. Now check the diagnostics.

par(mfrow = c(2,2))
plot(m)

Not perfect, but actually not too bad given we know this is a mis-specified model. Let’s try a generalised linear model with a poisson family (error distribution) and a log link function instead.

m <- glm(Species_abundance ~ Fire_severity, data = df, family = poisson(link = 'log'))
summary(m)

Call:
glm(formula = Species_abundance ~ Fire_severity, family = poisson(link = "log"), 
    data = df)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.36214    0.06847   34.50  < 2e-16 ***
Fire_severity -0.77409    0.13211   -5.86 4.64e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 112.957  on 99  degrees of freedom
Residual deviance:  78.035  on 98  degrees of freedom
AIC: 460.49

Number of Fisher Scoring iterations: 4

Okay, so what’s happened here? The coefficients look worse than before?

The model has estimated the coefficients on the ‘link scale’, which is in this case the log scale. The inverse of the natural log is the exponent. Let’s back transform and see what we can recover our known truths.

exp(coef(m))
  (Intercept) Fire_severity 
   10.6136780     0.4611211 

Voila. Close to what we expected - a mean species abundance of 10 when fire severity is equal to 0 (the y intercept), and for this to be reduced by ~50% with a one unit increase in fire severity. (Tip: try upping the sample size (n) in the simulation and see if you get closer to the known truth.)

What else is unusual about this model summary? Now we have null and residual deviance instead of R2? How come? To estimate coefficients for generalised linear models we can no longer use ordinary least squares (OLS) like we did with general linear models. Instead we use maximum likelihood for parameter estimation which, in a nutshell, means we choose coefficients that maximise the log-likelihood of our model to the observed data.

A consequence of maximum likelihood parameter estimation is that we now measure model fit as deviance - which is the deviation of our fitted model from a perfect, ‘saturated’, model. So, the higher the residual deviance, the worse the model fit (a residual deviance of 0 implies a perfect model fit).

Null deviance is the deviations of an ‘intercept only’ worst model from a perfect model. We can estimate R2 for our glm as the one minus the ratio of residual deviance to null deviance - if it is 1 then our model is ‘perfect’, if it is 0 then our model is the same as the ‘worst’.

1 - (m$deviance/m$null.deviance) # calculate R2 - ratio of goodness of fit of our model compared to worst model possible
[1] 0.309162

Now check diagnostics. Similar to the mis-specified linear model. This highlights that it is important to rely on your knowledge of the data generating process when building models, not just verification tools.

par(mfrow = c(2,2))
plot(m)

Count data and overdispersion

Often, ecological count data is over-dispersed. This means there is more unexplained variation in the data than the model we’ve fitted expects. Remember, the poisson distribution assumes that the mean (average species abundance) equals the variance. But often in ecological data we see greater variance in abundance estimates for larger values. Furthermore, we often have a larger number of 0 counts than the poisson dictates, which can also lead to overdispersion. (Note that under-dispersion can also occur, but is less common in ecology.)

How can we check for over-dispersion? First, we can divide the residual deviance (the amount of left over variation we have in our data, after fitting the model) by the residual degrees of freedom (the amount of left over variation our model expects). It should be close to 1. If not, over-dispersion is likely a problem in our model fit.

m$deviance/m$df.residual
[1] 0.7962746

So overdispersion is not a problem here (and also we didn’t see anything indicative of overdispersion in the above diagnostic plots).

The rule of thumb is, if the ratio is >1.5, then overdispersion needs to be dealt with. We can try a negative binomial distribution instead of a poisson to deal with overdispersion. Alternatively, sometimes overdispersion is driven largely by an excess number of 0s in our count data (typical in ecology). In this case we might try a zero-inflated poisson or zero-inflated negative binomial.

We can also check the diagnostic plots. If we see a ‘fan-shape’ in the residuals vs. fitted plot, implying errors are greater when fitted (predicted) abundance is higher, we can suspect that our data is over-dispersed for the model we are trying to fit to it. Of coarse, we didn’t see overdispersion

Let’s simulate over-dispersed species abundance and see for ourselves. To simulate this data, we’ll use the negative binomial distribution instead of the poisson. In contrast to the poisson, the mean of a prediction is not expected to equal the variance. Instead, we can define the mean and variance separately, and use this to simulate overdispersed counts from the negative binomial distribution.

We’ll keep everything as before, but this time will use lambda at the mean counts to simulate from the negative binomial, setting variance (i.e., expected dispersion of counts) equal to 100.

set.seed(123)
df2 <- data.frame(df, Species_abundance_overdisp = rnbinom(n, mu = lambda, size = 1))

ggplot(df2) +
  aes(x = Fire_severity, y = Species_abundance_overdisp) +
  geom_point() +
  ylab('Species abundance (over-dispersed)') +
  xlab('Fire severity') +
  geom_smooth(method = 'lm') +
  #geom_smooth(method = 'loess') +
  theme_classic()
`geom_smooth()` using formula = 'y ~ x'

Just as we expected, there is greater variability in species abundance than before. Now lets fit a glm assuming a poisson family (even though we know we should be using negative binomial) and check for overdispersion.

m <- glm(Species_abundance_overdisp ~ Fire_severity, data = df2, family = poisson(link = 'log'))
summary(m)

Call:
glm(formula = Species_abundance_overdisp ~ Fire_severity, family = poisson(link = "log"), 
    data = df2)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)     2.3327     0.0719  32.445  < 2e-16 ***
Fire_severity  -1.0572     0.1444  -7.324 2.41e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 685.21  on 99  degrees of freedom
Residual deviance: 629.94  on 98  degrees of freedom
AIC: 926.61

Number of Fisher Scoring iterations: 5

What do you notice right away about the model fit? The deviance is a lot higher than before.

m$deviance/m$df.residual
[1] 6.42794

The ratio is much greater than 1.5! The data is certainly over-dispersed for the poisson GLM (of course we expected this since we simulated the data from a negative binomial). Let’s just confirm with diagnostic plots too.

par(mfrow = c(2,2))
plot(m)

Not good. But, when dealing with high dispersion data (such as count data), we often wouldn’t expect residuals to be normally distributed anyway.

{DHARMa} offers a nice solution, simulating quantile residuals, which we would expect to be normally distributed if the glm is a good fit to the data.

library(DHARMa)
This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
resid <- simulateResiduals(fittedModel = m, plot = F)
plot(resid)

The randomised quantile residuals confirm that overdispersion is a major problem in this model. So let’s try fitting a negative binomial glm.

library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:patchwork':

    area
m <- glm.nb(Species_abundance_overdisp ~ Fire_severity, data = df2, link = 'log')
summary(m)

Call:
glm.nb(formula = Species_abundance_overdisp ~ Fire_severity, 
    data = df2, link = "log", init.theta = 0.9443723276)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)     2.3156     0.2209  10.484  < 2e-16 ***
Fire_severity  -1.0200     0.3920  -2.602  0.00926 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.9444) family taken to be 1)

    Null deviance: 121.40  on 99  degrees of freedom
Residual deviance: 114.38  on 98  degrees of freedom
AIC: 583.47

Number of Fisher Scoring iterations: 1

              Theta:  0.944 
          Std. Err.:  0.162 

 2 x log-likelihood:  -577.472 

Summary indicates over-dispersion is no longer a problem. Let’s check our randomised quantile residuals to confirm this model is a good fit.

resid <- simulateResiduals(fittedModel = m, plot = F)
plot(resid)

Much better.

But, notice that the coefficient estimates of both the poisson and negative-binomial glms fitted to over-dispersed data were not biased. So why do we care? Well, notice the standard error of the coefficients is much lower than in the mis-specified model compared to the correct one. So, if we don’t deal with overdispersion, we could wind up incorrectily inferring there is a significant effect, when there isn’t.

Fitting generalised linear models to real data

Hopefully find is over-dispersed, suggest trying negative-binomial or zero inflated poisson

Resources

What we haven’t covered, and where you can learn more:

  • Interactive effects (i.e., the response of y to one predictor depends on another)

  • Random effects (i.e., mixed effects models) - most models in ecology should probably be mixed effects models! Random effects offer many benefits, including allowing you to account for lack of independence in your observations (e.g., site level effects, spatial and/or temporal autocorrelation)

  • Model selection - how to choose among competing, plausible models (e.g., models that include different combinations of explanatory variables)

Here we have learned how to fit models in a frequentist paradigm, using either ordinary least squares or maximum likelihood. As a follow-up here, we will learn how to fit models in a Bayesian paradigm to allow us to fit more complex multivariate models.